1 Introduction

Bayesian ANalysis of Differential Localisation Experiments (BANDLE) is an integrative semi-supervised functional mixture model, developed by Crook et al. (2021), to obtain the probability of a protein being differentially localised between two conditions.

In this vignette we walk users through how to install and use the R (R Development Core Team 2011) Bioconductor (Gentleman et al. 2004) bandle package by simulating a well-defined differential localisation experiment from spatial proteomics data from the pRolocdata package (Gatto et al. 2014).

The BANDLE method uses posterior Bayesian computations performed using Markov-chain Monte-Carlo (MCMC) and thus uncertainty estimates are available (Gilks, Richardson, and Spiegelhalter 1995). Throughout this vignette we use the term differentially localised to pertain to proteins which are assigned to different sub-cellular localisations between two conditions. One of the main outputs of BANDLE is the probability that a protein is differentially localised between two conditions.

2 The data

In this vignette and Crook et al. (2021), the main data source that we use to study differential protein sub-cellular localisation are data from high-throughput mass spectrometry-based experiments. The data from these types of experiments traditionally yield a matrix of measurements wherein we have, for example, PSMs, peptides or proteins along the rows, and samples/channels/fractions along the columns. The bandle package uses the MSnSet class as implemented in the Bioconductor MSnbase package and thus requires users to import and store their data as a MSnSet instance. For more details on how to create a MSnSet see the relevant vignettes in pRoloc. There is also additional information and examples in the pRoloc sister package. The pRolocdata experiment data package is a good starting place to look for test data. This data package contains tens of quantitative proteomics experiments, stored as MSnSets. In the next section we load some real data from this package as a use-case to demonstrate how to run the bandle package.

2.1 Example data: spatialtemporal proteomic profiling of a THP-1 cell line

The data used in this vignette has been published in Mulvey et al. (2021) and is stored as MSnSet instances in pRolocdata package.

Briefly, Mulvey et al. (2021) performed triplicate hyperLOPIT experiments on THP-1 human leukamia cells where the samples were analysed and collected (1) when cells were unstimulated and then (2) following 12 hours stimulation with LPS (12h-LPS).

In the following code chunk we load 4 of the datasets from the study: 2 replicates of the unstimulated and 2 replicates of the 12h-LPS stimulated samples.

library("pRolocdata")
data("thpLOPIT_unstimulated_rep1_mulvey2021")
data("thpLOPIT_unstimulated_rep3_mulvey2021")
data("thpLOPIT_lps_rep1_mulvey2021")
data("thpLOPIT_lps_rep3_mulvey2021")

By typing the names of the datasets we get a dataMSnSet` data summary. For example,

thpLOPIT_unstimulated_rep1_mulvey2021
## MSnSet (storageMode: lockedEnvironment)
## assayData: 5107 features, 20 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: unstim_rep1_set1_126_cyto unstim_rep1_set1_127N_F1.4 ...
##     unstim_rep1_set2_131_F24 (20 total)
##   varLabels: Tag Treatment ... Fraction (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: A0AVT1 A0FGR8-2 ... Q9Y6Y8 (5107 total)
##   fvarLabels: Checked_unst.r1.s1 Confidence_unst.r1.s1 ... markers (107
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Loaded on Tue Jan 12 14:46:48 2021. 
## Normalised to sum of intensities. 
##  MSnbase version: 2.14.2
thpLOPIT_lps_rep1_mulvey2021
## MSnSet (storageMode: lockedEnvironment)
## assayData: 4879 features, 20 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: lps_rep1_set1_126_cyto lps_rep1_set1_127N_F1.4 ...
##     lps_rep1_set2_131_F25 (20 total)
##   varLabels: Tag Treatment ... Fraction (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: A0A0B4J2F0 A0AVT1 ... Q9Y6Y8 (4879 total)
##   fvarLabels: Checked_lps.r1.s1 Confidence_lps.r1.s1 ... markers (107
##     total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Loaded on Tue Jan 12 14:46:57 2021. 
## Normalised to sum of intensities. 
##  MSnbase version: 2.14.2

We see that the datasets thpLOPIT_unstimulated_rep1_mulvey2021 and thpLOPIT_lps_rep1_mulvey2021 contain 5107 and 4879 proteins respectively, across 20 channels. The data is accessed through different slots of the MSnSet (see str(thpLOPIT_unstimulated_rep1_mulvey2021) for all available slots). The 3 main slots which are used are those the contain the quantitation data, features i.e. PSM/peptide/protein information, and sample information, and these can be accessed using the functions exprs, fData, and pData, respectively.

To run bandle there are a few minimal requirements that the data must fulfil. Data are required to have - the same number of channels across conditions and replicates - the same proteins across conditons and replicates - data must be a list of MSnSet instances

If we use the dim function we see that the datasets we have loaded have the same number of channels but a different number of proteins per experiment.

We use the function commonFeatureNames to extract proteins that are common across all replicates. This function has a nice side effect which is that it also wraps the data into a list, ready for input into bandle.

thplopit <- commonFeatureNames(c(thpLOPIT_unstimulated_rep1_mulvey2021,  ## unstimulated rep
                                 thpLOPIT_unstimulated_rep3_mulvey2021,  ## unstimulated rep
                                 thpLOPIT_lps_rep1_mulvey2021,           ## 12h-LPS rep
                                 thpLOPIT_lps_rep3_mulvey2021))          ## 12h-LPS rep
## 3727 features in common

We now have our list of MSnSets

thplopit
## Instance of class 'MSnSetList' containig 4 objects.

We can visualise the data using the plot2D function from pRoloc

## create a character vector of title names for the plots
plot_id <- c("Unstimulated 1st rep", "Unstimulated 2nd rep",
             "12h-LPS 1st rep", "12h-LPS 2nd rep")

## plot the data
par(mfrow = c(2,2))
for (i in seq(thplopit))
    plot2D(thplopit[[i]], main = plot_id[i])
addLegend(thplopit[[4]], where = "topleft", cex = .75)

3 Running the bandle function

The main function of the bandle package is bandle, this uses a complex model to analyse the data. Markov-Chain Monte-Carlo (MCMC) is used to sample the posterior distribution of parameters and latent variables. From which statistics of interest can be computed. Here we only run a few iterations for brevity but typically one needs to run thousands of iterations to ensure convergence, as well as multiple parallel chains.

First, we need to fit non-parametric regression functions to the markers profiles, upon which we place our analysis. This uses Gaussian processes. We use the fitGPmaternPC function and the fitting uses some default penalised complexity priors (see ?fitGP), which work well. However, these can be altered, which is demonstrated in the next code chunk

par(mfrow = c(3,4))
gpParams <- lapply(thplopit, function(x) fitGPmaternPC(x))

We apply the fitGPmaternPC function on every MSnSet experiment by calling lapply over the thplopit list of data. The output of fitGPmaternPC returns a list of posterior predictive means and standard deviations. As well as MAP hyperparamters for the GP. As a side effect the posterior predictive distributions are overlayed with markers protein profiles for each subcellular class.

The prior needs to form a K*3 matrix (where K is the number of subcellular classes in the data), and 3 for (1) the prior, (2) length-scale amplitude and (3) standard deviation parameters (see hyppar in ?fitGP). Increasing these values, increases the shrinkage. For more details see the manuscript by Crook et al. (2021).

K <- length(getMarkerClasses(thplopit[[1]], fcol = "markers"))
pc_prior <- matrix(NA, ncol = 3, K)
pc_prior[seq.int(1:K), ] <- matrix(rep(c(10, 60, 250),
                                       each = K), ncol = 3)

Now we have generated these complexity priors we can pass them as an argument to the fitGPmaternPC function. For example,

par(mfrow = c(3,4))
gpParams <- lapply(thplopit,
                   function(x) fitGPmaternPC(x, hyppar = pc_prior))

By looking at the plot of posterior predictives we can see the GP fit looks sensible.

4 Setting the prior on the weights

The next step is to set up the matrix Dirichlet prior on the mixing weights. These weights are defined across datasets so these are slightly different to mixture weights in usual mixture models. The \((i,j)^{th}\) entry is the prior probability that a protein localises to organelle \(i\) in the control and \(j\) in the treatment. This mean that off-diagonal terms have a different interpretation to diagonal terms. Since we expect relocalisation to be rare, off-diagonal terms should be small. The following functions help set up the priors and how to interpret them. The parameter q allow us to check the prior probability that more than q differential localisations are expected.

set.seed(1)
dirPrior = diag(rep(1, K)) + matrix(0.0005, nrow = K, ncol = K)
predDirPrior <- prior_pred_dir(object = thplopit[[1]],
                               dirPrior = dirPrior,
                               q = 15)

The mean number of relocalisation is small:

predDirPrior$meannotAlloc
## [1] 0.2308647

The prior probability that more than q differential localisations are expected is small

predDirPrior$tailnotAlloc
## [1] 6e-04

The full prior predictive can be visualised as histogram. The prior probability that proteins are allocated to different components between datasets concentrates around 0.

hist(predDirPrior$priornotAlloc, col = getStockcol()[1])

5 The bandle function

We are now ready to run the main bandle function. Remember to carefully select the datasets and replicates that define the control and treatment. Here for convience of the vignette we only run 2 of the 3 replicates and furthermore, iterations are typically \(1000\)s. Again, a few are run here for the convenience of the vignette run-time.

control <- list(thplopit[[1]], thplopit[[2]])
treatment <- list(thplopit[[3]], thplopit[[4]])

bandleres <- bandle(objectCond1 = control,
                    objectCond2 = treatment,
                    numIter = 50, # usually 10,000
                    burnin = 5, # usually 5,000
                    thin = 1, # usually 20
                    gpParams = gpParams,
                    pcPrior = pc_prior,
                    numChains = 1, # usually >=4
                    dirPrior = dirPrior)
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |=============                                                         |  18%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  22%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |==================                                                    |  26%
  |                                                                            
  |====================                                                  |  28%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  32%
  |                                                                            
  |========================                                              |  34%
  |                                                                            
  |=========================                                             |  36%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |===============================                                       |  44%
  |                                                                            
  |================================                                      |  46%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |=========================================                             |  58%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |===========================================                           |  62%
  |                                                                            
  |=============================================                         |  64%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |================================================                      |  68%
  |                                                                            
  |=================================================                     |  70%
  |                                                                            
  |==================================================                    |  72%
  |                                                                            
  |====================================================                  |  74%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |=========================================================             |  82%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |==============================================================        |  88%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |================================================================      |  92%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |===================================================================   |  96%

The bandle function generates an object of class bandleParams. The show method indicates the number of parallel chains that were run, this should typically be greater than 4.

bandleres
## Object of class "bandleParams"
## Method: bandle 
## Number of chains: 1

6 The summary slot

Currently, the summary slots of the bandleres object are empty. The summaries function accesses them.

summaries(bandleres)
## [[1]]
## An object of class "bandleSummary"
## Slot "posteriorEstimates":
## DataFrame with 0 rows and 0 columns
## 
## Slot "diagnostics":
## <0 x 0 matrix>
## 
## Slot "bandle.joint":
## <0 x 0 matrix>
## 
## 
## [[2]]
## An object of class "bandleSummary"
## Slot "posteriorEstimates":
## DataFrame with 0 rows and 0 columns
## 
## Slot "diagnostics":
## <0 x 0 matrix>
## 
## Slot "bandle.joint":
## <0 x 0 matrix>

These can be populated as follows

bandleres <- bandleProcess(bandleres)

These slot have now been populated

summary(summaries(bandleres))
##      Length Class         Mode
## [1,] 1      bandleSummary S4  
## [2,] 1      bandleSummary S4

The posteriorEstimates slot gives posterior quantities of interest for different proteins. The object is of length 2, 1 for control and 1 for treatment.

length(summaries(bandleres))
## [1] 2

One quantity of interest is the protein allocations, which we can plot in a barplot. We create a new object from populating the bandleres objects using the summaries function.

par(mfrow = c(1, 2), oma = c(6,2,2,2))

pe1 <- summaries(bandleres)[[1]]@posteriorEstimates
barplot(table(pe1$bandle.allocation), col = getStockcol()[2], las = 2, main = "Protein allocation: control")

pe2 <- summaries(bandleres)[[2]]@posteriorEstimates
barplot(table(pe2$bandle.allocation), col = getStockcol()[2], las = 2, main = "Protein allocation: treatment")

The bar plot tells us for this case study that the majority of unlabelled proteins are allocated to the nucleus (irrespective of the posterior probability).

7 Differential localisation probability

As previously mentioned the term “differentially localised” is used to pertain to proteins which are assigned to different sub-cellular localisations between two conditions. For the majority of users this is the output they are keen to extract using the BANDLE method.

The differential localisation probability is also stored here, which tells us which proteins are most likely to differentially localise. The rank plot is a good visualisation. Indicating that most proteins are not differentially localised and there are a few confident differentially localised protiens of interest.

diffloc_probs <- pe1$bandle.differential.localisation
plot(diffloc_probs[order(diffloc_probs, decreasing = TRUE)],
     col = getStockcol()[3], pch = 19, ylab = "Probability", 
     xlab = "Rank", main = "Differential localisation rank plot")

We can examine the top n proteins (here 100) and produce bootstrap estimates of the uncertainty (note here the uncertainty is likely to be underestimated as we did not produce many MCMC samples). These can be visualised as ranked boxplots

set.seed(1)
bt <- bootstrapdiffLocprob(params = bandleres, top = 100)
boxplot(t(bt), col = getStockcol()[5],
        las = 2, ylab = "Probability", ylim = c(0, 1),
        main = "Differential localisation \nprobability plot (top 100 proteins)")

We see that all of these top 100 proteins have a posterior distribution of 1. We can plot the top 1000 to see how these distributions compare.

8 Visualising translocations

We can visualise the changes in localisation between conditions on an alluvial plot using the plotTranslocations function

plotTranslocations(bandleres)

Or alternatively, on a chord (circos) diagram

plotTranslocations(bandleres, type = "chord")

9 Allocation probabilities

The allocation probabilities are stored in the tagm.joint slot. These could be visualised in a heatmap

bjoint_control <- summaries(bandleres)[[1]]@bandle.joint
pheatmap(bjoint_control, cluster_cols = FALSE, color = viridis(n = 25))

bjoint_treatment <- summaries(bandleres)[[2]]@bandle.joint
pheatmap(bjoint_treatment, cluster_cols = FALSE, color = viridis(n = 25))

–> –> –> –>

10 Session information

All software and respective versions used to produce this document are listed below.

sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] viridis_0.6.0        viridisLite_0.4.0    pheatmap_1.0.12     
##  [4] pRolocdata_1.29.1    ggplot2_3.3.3        bandle_1.0          
##  [7] pRoloc_1.30.0        BiocParallel_1.24.1  MLInterfaces_1.70.0 
## [10] cluster_2.1.1        annotate_1.68.0      XML_3.99-0.6        
## [13] AnnotationDbi_1.52.0 IRanges_2.24.1       MSnbase_2.16.1      
## [16] ProtGenerics_1.22.0  mzR_2.24.1           Rcpp_1.0.7          
## [19] Biobase_2.50.0       S4Vectors_0.28.1     BiocGenerics_0.36.0 
## [22] BiocStyle_2.18.1    
## 
## loaded via a namespace (and not attached):
##   [1] circlize_0.4.12       BiocFileCache_1.14.0  plyr_1.8.6           
##   [4] splines_4.0.5         digest_0.6.27         foreach_1.5.1        
##   [7] htmltools_0.5.1.1     magick_2.7.3          gdata_2.18.0         
##  [10] ggalluvial_0.12.3     fansi_0.4.2           magrittr_2.0.1       
##  [13] memoise_2.0.0         doParallel_1.0.16     mixtools_1.2.0       
##  [16] limma_3.46.0          recipes_0.1.15        gower_0.2.2          
##  [19] askpass_1.1           lpSolve_5.6.15        prettyunits_1.1.1    
##  [22] colorspace_2.0-0      ggrepel_0.9.1         blob_1.2.1           
##  [25] rappdirs_0.3.3        xfun_0.22             dplyr_1.0.5          
##  [28] crayon_1.4.1          jsonlite_1.7.2        hexbin_1.28.2        
##  [31] impute_1.64.0         survival_3.2-10       iterators_1.0.13     
##  [34] glue_1.4.2            gtable_0.3.0          ipred_0.9-11         
##  [37] zlibbioc_1.36.0       kernlab_0.9-29        shape_1.4.5          
##  [40] scales_1.1.1          vsn_3.58.0            mvtnorm_1.1-1        
##  [43] DBI_1.1.1             xtable_1.8-4          progress_1.2.2       
##  [46] bit_4.0.4             proxy_0.4-25          mclust_5.4.7         
##  [49] preprocessCore_1.52.1 lbfgs_1.2.1           lava_1.6.9           
##  [52] prodlim_2019.11.13    sampling_2.9          httr_1.4.2           
##  [55] FNN_1.1.3             RColorBrewer_1.1-2    ellipsis_0.3.1       
##  [58] farver_2.1.0          pkgconfig_2.0.3       nnet_7.3-15          
##  [61] sass_0.3.1            dbplyr_2.1.1          utf8_1.2.1           
##  [64] caret_6.0-86          tidyselect_1.1.0      rlang_0.4.10         
##  [67] reshape2_1.4.4        munsell_0.5.0         tools_4.0.5          
##  [70] LaplacesDemon_16.1.4  cachem_1.0.4          generics_0.1.0       
##  [73] RSQLite_2.2.6         evaluate_0.14         stringr_1.4.0        
##  [76] fastmap_1.1.0         mzID_1.28.0           yaml_2.2.1           
##  [79] ModelMetrics_1.2.2.2  knitr_1.33            bit64_4.0.5          
##  [82] purrr_0.3.4           randomForest_4.6-14   dendextend_1.14.0    
##  [85] ncdf4_1.17            nlme_3.1-152          xml2_1.3.2           
##  [88] biomaRt_2.46.3        compiler_4.0.5        curl_4.3             
##  [91] e1071_1.7-6           affyio_1.60.0         tibble_3.1.0         
##  [94] bslib_0.2.4           stringi_1.5.3         highr_0.8            
##  [97] lattice_0.20-41       Matrix_1.3-2          vctrs_0.3.7          
## [100] pillar_1.6.0          lifecycle_1.0.0       BiocManager_1.30.12  
## [103] GlobalOptions_0.1.2   jquerylib_0.1.3       MALDIquant_1.19.3    
## [106] data.table_1.14.0     R6_2.5.0              pcaMethods_1.82.0    
## [109] affy_1.68.0           bookdown_0.21         gridExtra_2.3        
## [112] codetools_0.2-18      MASS_7.3-53.1         gtools_3.8.2         
## [115] assertthat_0.2.1      openssl_1.4.3         withr_2.4.1          
## [118] hms_1.0.0             grid_4.0.5            rpart_4.1-15         
## [121] timeDate_3043.102     tidyr_1.1.3           coda_0.19-4          
## [124] class_7.3-18          rmarkdown_2.7         segmented_1.3-3      
## [127] pROC_1.17.0.1         lubridate_1.7.10

References

Crook, Oliver M, Colin TR Davies, Laurent Gatto, Paul DW Kirk, and Kathryn S Lilley. 2021. “Inferring Differential Subcellular Localisation in Comparative Spatial Proteomics Using BANDLE.” bioRxiv.
Gatto, Laurent, Lisa M. Breckels, Samuel Wieczorek, Thomas Burger, and Kathryn S. Lilley. 2014. “Mass-Spectrometry Based Spatial Proteomics Data Analysis Using pRoloc and pRolocdata.” Bioinformatics.
Gentleman, Robert C., Vincent J. Carey, Douglas M. Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, et al. 2004. “Bioconductor: Open Software Development for Computational Biology and Bioinformatics.” Genome Biol 5 (10): –80. https://doi.org/10.1186/gb-2004-5-10-r80.
Gilks, Walter R, Sylvia Richardson, and David Spiegelhalter. 1995. Markov Chain Monte Carlo in Practice. CRC press.
Mulvey, Claire M., Lisa M. Breckels, Oliver M. Crook, David J. Sanders, Andre L. R. Ribeiro, Aikaterini Geladaki, Andy Christoforou, et al. 2021. “Spatiotemporal Proteomic Profiling of the Pro-Inflammatory Response to Lipopolysaccharide in the THP-1 Human Leukaemia Cell Line.” Nature Communications 12 (1). https://doi.org/10.1038/s41467-021-26000-9.
R Development Core Team. 2011. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.